ASTR proposes an asymmetric RGB-X tracking architecture that consists of a multi-layer encoder and a single-layer decoder. It achieves strong tracking performance with fewer parameters and FLOPs compared to conventional dual-stream trackers, making it efficient and accurate for RGB-Thermal, RGB-Event, and RGB-Depth tracking tasks.
Fig 1. The proposed ASTR architecture with asymmetric encoder-decoder design
The decoder applies cross-attention between search embeddings and modal-blended template embeddings. Modal blending is performed via learnable softmax-based weights on RGB and X modality features, creating N diverse representations.
ASTR outperforms state-of-the-art trackers in LasHeR, VisEvent, RGBT234, and DepthTrack datasets with up to 55.6% fewer FLOPs and competitive tracking accuracy.
Fig 2. ASTR vs. other RGB-X trackers on multiple datasets